A PARALLEL PREPROCESSING FOR THE OPTIMAL 
ASSIGNMENT PROBLEM BASED ON DIAGONAL SCALING 
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Abstract. We present a preprocessing method, which is suitable for parallel computation, to 
solve large optimal assignment problems. We think of the optimal assignment problem as a limit of a 
deformation of an entropy maximization problem. We show that the matrix maximizing the entropy 
converges, as the deformation parameter goes to infinity, to a matrix whose nonzero entries are 
precisely the ones belonging to optimal assignments. For every value of the deformation parameter, 
the matrix of maximal entropy can be computed by Sinkhorn iteration. This leads to a parallel 
preprocessing for the optimal assignment problem, which allows to delete entries that do not belong 
to optimal assignments, so that the reduced problem becomes executable on a sequential machine. 
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1. Introduction. One of the most classical problems in combinatorial optimiza- 
tion is the optimal assignment problem. Several applications of this problem arise in 
different fields of applied sciences such as bioinformatics for protein structure align- 
ment problem |Hol93[ ILCL04] , VLSI design [HCLH90 , image processing and com- 
puter vision [CWC + 96] , and the pivoting problem in the solution of large linear sys- 
tems of equations ON96, DKOOl ILD03] . Thus, this problem has received considerable 
attention and several algorithms have been proposed to solve it. 

The first polynomial time algorithm to solve this problem was proposed by H. W. 
Kuhn in 1955 Kuh55 . It works in 0(n A ) time, which was improved to 0(n 3 ) by 
Edmonds and Karp |EK70j (see also |DK69j ). In the sparse case, Fredman and Tar- 
jan [FT87j proposed an improved algorithm which uses Fibonacci heaps for the short- 
est paths computations. It runs in 0(n(m + nlogn)) time. Several other algorithms 
have also been developed. We refer the interested reader to the recent book of Burkard 
et al. [BDM09] . 

In this paper we exploit the connection between the optimal assignment problem 
and entropy maximization. The latter is well studied in the field of convex optimiza- 
tion |FRT97j . 

The main idea is to think of the optimal assignment problem as the limit of a 
deformation of an entropy maximization problem. More precisely, given an n x n 
nonnegative matrix A = (ay), let us look for an n x n bistochastic matrix X = (xij) 
maximizing the relative entropy 

J P (X):=- xtiQo&frMj) - 1) , (1.1) 

l<i,j<n 
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Here, p is the deformation parameter. We will show in Section [5] that when p 
goes to infinity, the unique solution X{p) = (xj,j(p)) of the entropy maximization 
problem converges to a point X(oo) which is of maximal entropy among the ones 
in the convex hull of the matrices representing optimal permutations. In particular, 
if there is only one optimal permutation, X(p) converges to the matrix representing 
this optimal permutation. In Section |3] we prove that, for X(p) as the solution to 
equation for some value of p and for X(oo) as the solution when p — >• oo, we 

have 

\xij(p) - x i:j (oo)| = 0(exp(-cp)) 

for c > 0. This shows an exponential convergence to the optimal solution when p 
increases. 

The maximal entropy matrix X{p) can be computed by any matrix scaling algo- 
rithm such as Sinkhorn iteration |SK67j or Newton method [KR07 . Subsequently, 
these iterative methods can be used to develop new algorithms to solve the optimal 
assignment problem and related combinatorial optimization problems. 

An interesting application of this new approach, is the solution of large scale 
dense optimal assignment problems. Several efforts have been made to solve this 
problem BT09, L094 . A well-known application arises from the approximation al- 
gorithms and heuristics for solving the Asymmetric Traveling Salesman Problem or 
the Vehicle Routing Problem. There are also some applications in object recogni- 
tion and computer vision. An application to cosmology (reconstruction of the early 
universe) can be found in the work of Brenier et al. [BFH+03 . For a survey on the 
applications of large dense linear assignment problems, we refer the reader to [BT09] . 
Models of large dense random assignment problems are also considered in [MPV, 
Ch. VII] from the point of view of statistical physics. 

Here, we investigate a preprocessing algorithm which can be used to solve large 
scale optimal assignment problems. This preprocessing is based on an iterative 
method that eliminates the entries not belonging to an optimal assignment. This 
reduces the initial problem to a much smaller problem in terms of memory require- 
ments. This is illustrated in Figures fOI and ll~2l 

The idea of this algorithm is to take p large enough, then apply a diagonal scaling 
algorithm to until convergence to a bistochastic matrix X, and finally delete the 
small entries of X. Here, the exponential of leads to numerical overflow for large 
values of p. However, we shall show that it is possible to implement this iteration in a 
numerically stable way. The present algorithm assumes the existence of at least one 
matching, since otherwise, the Sinkhorn iteration may not converge. However, we note 
that matrix balancing (Sinkhorn iteration) can also be used to detect the existence of 
a perfect matching, as shown by Linial, Samorodnitsky and Wigderson [LSW00 . 

We consider two variants of the algorithm, one by using the Sinkhorn iteration 
as the diagonal scaling algorithm and the other one by using Newton iteration. The 
advantage of Sinkhorn iteration is that, it can be efficiently implemented in paral- 
lel ADRU08, DRU08 . Thus we show that for very large dense optimal assignment 
problems the data of which cannot be stored in one machine, the parallel Sinkhorn 
iteration can be used to reduce the size of the problem and then, it can be solved by 
any classical method. On the other hand, the advantage of Newton method is the 
speed of the convergence to bistochastic matrix. 

For both variants, we present several numerical results of various full and sparse 
matrices from gallery of Matlab and The University of Florida Sparse Matrix Col- 
lection. We show that the Sinkhorn iteration can be efficiently used to decrease the 
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size of the dense matrices, up to 99% in a small number of iterations. For Newton 
iteration, we show that it is not only efficient for dense matrices but also efficient for 
sparse symmetric matrices. 

Note also that the present approach yields approximate dual variables, which 
provide an approximate optimality certificate for the assignment which is found (Sec- 
tion gTTJl. 

In the last section, we introduce an iterative method which is based on a modifica- 
tion of the Sinkhorn scaling algorithm, in which the deformation parameter is slowly 
increased (this procedure is reminiscent from simulated annealing, the parameter p 
playing the role of the inverse of the temperature). We prove that this iteration, 
which we refer to as deformed- Sinkhorn iteration, converges to a matrix whose entries 
that belong to the optimal permutations are nonzero, while all the other entries are 
zero. An estimation of the rate of convergence is also presented, but this appears 
to be mostly of theoretical interest since in practice, the convergence of this variant 
appears to be slow. 

2. Entropy maximization and matrix scaling. The diagonal scaling prob- 
lem can be generally defined as finding diagonal matrices D r and D c with positive 
diagonal entries such that the scaled matrix D r AD c has prescribed row and col- 
umn sums. Due to the variety of its applications, this problem has been well stud- 
ied [MS69| B ru74[ ISK67] . A comparison of the proposed algorithms to solve this 
problem, can be found in [SZ90j . A remarkable special case arises when the row and 
column sums of the matrix X = D r AD c are required to be identically one, so that 
X is bistochastic. Then, the following theorem provides a sufficient condition for the 
existence of a diagonal scaling. 

Theorem 2.1 (Sinkhorn SK67 ). Let A be annxn nonnegative matrix with total 
support (every positive entry belongs to a diagonal). Then there exist diagonal matrices 
D r and D c such that D r AD c is bistochastic. Moreover, if A is fully indecomposable, 
then D r and D c are unique up to a constant factor. 

Now, consider the following optimization problem, which consists in finding an 
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n x n bistochastic matrix X = maximizing the following relative entropy 

max J P (X), J P {X) := V)ary&y + p~ 1 S(X), bij = log ay, (2.1) 

ij 

where 

S(X) := -^J log Xjj 

ij 

is the entropy function, p > is a parameter, and -B n denotes the set ofnxn 
bistochastic matrices. The convention x (— oo) is understood when interpreting the 
product Xijbij. 

We shall assume that the matrix A := (ay) has total support, so that the diagonal 
matrices D r and D c are known to exist. We denote by G(A) := | ay > 0} the 

pattern (set of non-zero entries) of the matrix A. 

The general relation between the entropy maximization and scaling problems is 
well known, see e.g. [Sch89 for an overview. We shall need in particular the following 
result. 

PROPOSITION 2.2 (Corollary of |BLN94[ Th. 3.1]). Let A be a matrix with total 
support. Then, the solution X(p) of the entropy maximization problem indicated in 
Eauation \2.1\ is unique and it is characterized by the existence of two positive vectors, 
U and V, such that x^ = a^UiVj for all i,j. 

Thus, the characterization of the proposition shows that X is obtained from the 
pth Hadamard power A^ := (a?-) by a diagonal scaling. 

The previous proposition is a special case of Theorem 3.1 of [BLN94 , which is 
established in a more general infinite dimensional setting (for p — 1; but the result for 
an arbitrary p follows trivially from it). We shall need in the sequel a few elements 
of the proof, which we next include. 

First, the function J p is upper semi-continuous, and B n is compact, hence, the 
maximum of J p over B n is attained. If there is at least one permutation a such 
that X)i ^io-(i) > — oo, the associated permutation matrix X = (xy), with x^j = 1 if 
j = o~(i), and iEy = otherwise, is such that J P (X) > — oo. Then since the maximum 
of J p is attained, its value must be finite. Moreover, since the objective function is 
strictly concave and the feasible set is convex, the point of maximum X(p) is unique. 

We claim that X{p) has the same pattern (set of positions of non-zeros entries) 
as the matrix A. 

To see this, let Y be a bistochastic matrix with the same pattern as A, i.e. 
yij > iff aij > 0. Assume by contradiction that X(p) does not have the same 
pattern as A, so that Xij(p) — and yij{p) > for some Then because the 

right derivative of the function t h-> — tlogt at t = + is infinite, the right derivative 
of t i-> J p (X(p) + t(Y — X(p))) at t = + is easily seen to be infinite, and so, 
Jp(X{p) + t(Y - X(p))) > and X(p) + t(Y - X{p)) e B n hold for t small enough, 
contradicting the optimality of X(p). Hence, the claim is established. 

Consider now the Lagrange function 

L(X, U, V) = Jp(X) + J2 u i(J2 x ^ - X ) + E v i E - ^ ' 

i j j i 

where U — (ui) and V — (vj) are vectors of Lagrange multipliers. The stationarity 
condition implies that if X is an optimal solution of the entropy maximization problem 
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indicated in Equation 12.11 then there must exist two vectors of multipliers U and V 
such that, for all € G(A), 

dL 

= b i:j + logXij) + Ui + v 3 = . 



dxij 



It follows that 

Xij(p) = exp(p(bij + Ui + vj) - 1) , V(i,j) £ G(A) 

showing that X is obtained from the pth Hadamard power := (a^) by a diagonal 
scaling. 

Using the latter characterization of X(p), we observe that: 



We now study the convergence of X(p) as p tends to infinity. We shall consider 
the face F of the polytope of bistochastic matrices consisting of the optimal solutions 
of the linear programming formulation of the optimal assignment problem 



max ) Xijbij = max N b ia u\ 



xeB n * — ' aee 

ij 



Theorem 2.3. As p tends to infinity, the matrix X(p) converges to the unique 
matrix X* maximizing the entropy among the ones that belong to the face F consisting 
of the convex hull of optimal permutation matrices. In particular, if the solution 
of the optimal assignment problem is unique, then X(p) converges to the associated 
bistochastic matrix. 

Proof. Since X(p) is the point of maximum of J p , 

J p (X(p))=Y2xij(p)bij + P - 1 S(X(p)) 

ij 

> j p (x*) = ^4^ + p- 1 s'(x*) 

ij 

= maxVi w(l) +p~ 1 S(X*) 

i 

Consider a sequence (pk)k>i converging to infinity, and assume that X(pk) converges 
to some matrix Z , which must belong to B n . Setting p = p^ in the previous inequality 
and taking the limit as k tends to infinity, we get z ijbij > max CT6 e„ J2i ha(i)i 
which shows that Z belongs to the face F. 
Observe that 

p-\S{X{ Pk )) - S{X*)) = (J Pk (X(p k )) - J Pk (X* j) + X -Y^(Pk)bij 

\ ij ij J 

is the sum of two nonnegative terms, because X(pk) is a point of maximum of J Pk , 
and X* £ F is a convex hull of matrices representing optimal permutations. It follows 
that S(X(pk)) — S(X*) > 0, and so, if Z is any accumulation point of X{pk) as k 
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tends to infinity, S(Z) — S(X*) > 0, showing that Z is of maximal entropy among the 
matrices in F. Since the entropy function is strictly convex, X* is is the only point 
with the latter property, and so every accumulation point of X(pk) is equal to X* , 
showing that X(p) converges to X* as p — > oo. □ 

Corollary 2.4. If there is only one optimal permutation, then X(j>) converges 
to the corresponding permutation matrix. 

3. The speed of convergence. We have already shown in Theorem 12.31 that 
the maximal entropy solution X{p) converges as p tends to infinity, to a matrix X(oo) 
which is a convex hull of optimal permutation matrices. In particular, X(j>) converges 
to an optimal permutation matrix if the optimal permutation is unique. Now, the 
question is how fast is this convergence. This is answered by the following theorem. 

Theorem 3.1. Assume that the matrix A has total support and that log ay £ Q, 
for all (i,j) such that a,ij > 0. Then, there exists a positive constant c such that, for 
all i, j £ [n], 

\Xijip) - x^(oo)| = 0(exp(-cp)) 

To establish Theorem 13.11 recall that a real Puiseux series in the variable t is an 
expression of the form 

f = Y i Ckt k/r (3.1) 

k>k 

where r £ N is positive, k £ Z, Ck £ K for all k, and the sum is taken over all k £ Z 
such that k > k. We denote by R CV g{{i}} the set of real Puiseux series that arc 
absolutely convergent for all t of small enough positive modulus. 

Lemma 3.2. For all i,j £ [n], there exists a Puiseux series of the form (|3.1I) . 
such that 

x ij{p) = /(exp(-p)) = 22 c fe exp(-pk/r) 
k>k 

the latter series being absolutely convergent for all large enough p. 

In order to establish this result, we shall use some tools from the theory of real 
ordered fields, for which we refer the reader to |BPR061 chapter 2]. 

Let us consider the following statement: if a nonnegative matrix A has total 
support, then there exists a unique nonnegative matrix X with row and column sums 
1, and there exist diagonal matrices D and D' with positive diagonal entries such that 

A = DID' . 

According to Sinkhorn's theorem |SK67j and to Proposition ^. 2[ this statement is true 
when the entries of A, X, D, D' belong to the field of real numbers. Moreover, this 
statement belongs to the first order theory of the real closed field (K, +, x , 0, 1, >). By 
Tarski's theorem [TaxSlj . any first order statement that is valid in a special real closed 
field must also be valid in any real closed field. In particular, the above statement 
holds over the field of convergent real Puiseux series, which is known to be a real 
closed field. Indeed, the fact that formal Puiseux series constitute a real closed field 
is standard, the proof that the same is true in the case of convergent Puiseux series 
can be found in |BK761 § 10]. 
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Thus for a matrix A(t) g R cvg {{t}} nxn with total support, there exists diago- 
nal matrices D(t), D' (t), £ R cvg {{t}} nxn together with a unique bistochastic matrix 
X(t) e M cvg {{i}}" x " such that A(t) = D(t)X(t)D'(t). 

We choose now the matrix A(t) = (dij (t)) such that a%j (t) = i log aij where log a%j G 
Q. Then, the entries of the corresponding matrix X(t) have the form 

+00 

and this series is convergent for a suitably small positive t. Make now the substitution 
t = exp(-p). We deduce that for all suitably large p, 

+00 

x ij(p) = Ci-jkexp^pk/nj) . (3.2) 

Since x(p)ij has a finite limit as p — > 00, understanding that kij is the first index k for 
which the coefficient Cijk is non-zero, we necessarily have kij > 0, so that Xij(oo) can 
be identified to the constant term in the latter series. Setting c = mim^- (kij + 
we get 

\xij[p) - Xij{oo)\ = (9(exp(-cp)) , 

which proves Theorem 13. II 

Remark 3.3. The assumption that loga.^ e Q in Theorem 13.11 is inconvenient. 
It could be avoided by replacing the field of converging Puiseux series by a field of 
converging generalized Dirichlet series, along the lines of [Mar]. However, this would 
require working out the convergence issues which are not treated in [Mar] . 

Remark 3.4. The formulation (|2.ip is somehow reminiscent of interior point 
methods, in which the entropy S(X) — — . Xij log 2;^ is replaced by a log-barrier 
function (the latter would be J2ij^°S x ij m the present setting). The present X(p) 
thought of as a function of p — > 00 is analogous to the central path, and as does 
the central path, X(p) converges to a face containing optimal solutions. However, the 
entropy S(X) does not satisfies the axioms of the theory of self-concordant barriers on 
which the analysis of interior point methods is based. Indeed, the speed of convergence 
in 0(exp(— cp)) appears to be of a totally different nature by comparison with the 
speed of 0(1 /p) observed in interior point methods |NN94j . 

Example 3.5. The constant c appearing in Theorem 13. II can be small if there 
are several nearly optimal permutations, and then a large value of p may be needed 
to approximate X(oo). However, in such much smaller value of p turns out 

to be enough for the method described in the next sections, the aim of which is to 
eliminate a priori entries not belonging to (nearly) optimal permutations. This is 
illustrated by the following matrix, in which the identity permutation is optimal, and 
the transposition (1,2) is nearly optimal: 

/ 1 0.99 0.99\ 
A = 0.99 1 1/3 . 
\0.25 0.5 1 / 

For p = 10, we have the following matrix, the significant entries of which indicate 
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precisely the optimal and nearly optimal permutations: 

0.5195148 0.4595136 0.0210196\ 
0.4804643 0.5195864 0.0000004 . 
0.0000209 0.0209000 0.9789800/ 

The convergence of X(p) to X(oo) is illustrated in Figures [3.11 Observe that the 
graph of logXij(p) as a function of p is approximately piecewise affine. In fact, each 
piece corresponds to a monomial in the Puiseux series expansion (|3 . 2[) (see [VirOlj 
for an explanation of this fact). The path p H> X(p) converges quickly to the face 
containing the two nearly optimal permutations and slowly to the unique optimal 
permutation. 




Fig. 3.1. The variation of log 10 xia(p) as a function of p. 



Remark 3.6. Finding an explicit formula for the speed of convergence c appears 
to be an interesting combinatorial problem (which is beyond the scope of this paper). 

4. Preprocessing for the optimal assignment problem. For a fixed p > 0, 
the solution for the entropy maximization problem displayed in Equation (I2.1[) can 
be computed by any scaling algorithm such as Sinkhorn iteration or Newton method. 
Using Theorem 13. 11 it can be seen that if the original matrix has only one optimal 
permutation, the order of magnitude of all the entries which belong to the optimal 
permutation will be 1 ± 0(exp(— cp)) while the order of magnitude of all other entries 
are (9(exp(— cp)). As an example, consider the following 5 by 5 random matrix with 
the bold entries belonging to optimal permutation. 

/0.292 0.502 0.918 0.281 0.686\ 

0.566 0.437 0.044 0.128 0.153 

A= 0.483 0.269 0.482 0.778 0.697 

0.332 0.633 0.264 0.212 0.842 

\0.594 0.405 0.415 0.112 0.406 / 

By applying the Sinkhorn iteration on A 1 -- 50 ^ the following matrix can be computed. 



X(50) 



/3.4J5- 
4.8E 
2.5E 
1.5E 



27 
02 
13 
23 



\ 9.5E-01 



1.5JS-08 
9.4E-01 

4.6E- 19 
1.2.E-02 
4.1E-02 



1.0E+00 

4.6E- 56 
9.3.E- 12 
6.2£-27 
6.2£-07 



7.4E - 26 
4.0E- 32 
1.0E+00 
4.3.E-31 
1.0E — 34 



4.7 E - 06\ 
7.9E - 28 
1.0S-02 
9.8E-01 

2.3^-06/ 
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Thus, for sufficiently large values of p, when X(p) is an e— bistochastic matrix, meaning 
that some distance between X{p) and a bistochastic matrix is less than e, one may 
delete all the small entries which are less than a threshold t, chosen consistent with 
e, while keeping all others. In this way the size of the original problem in terms of 
memory requirements will be reduced to a much smaller one. 

For a column (row) stochastic matrix, that is a matrix for which the sum of 
all columns (rows) are one, the distance to the set of bistochastic matrices will be 
measured by max^ |rj — 1| where r,; indicates the ith row (column) sum. 

Determining the coarsest accuracy e and the maximal threshold t which are needed 
to find an optimal permutation would require to know the maximal entropy solution 
X(oo) characterized in Theorem 12.31 This information is in general not available. 
However, the worst case can be considered to be the one where A(oo) is uniform, 
with all entries equal 1/n (and n\ optimal permutations). Since we need to preserve 
the optimal permutations, this leads to a conservative choice e = t = 1/n which we 
adopted in the present experimental results. The choice of the value of p will be 
discussed in Section 14.1.21 This leads to Algorithm Q] 

Algorithm 1 An optimal assignment preprocessing for fixed p 
input: A,p 

n 4— size(A, 1) 
e,t «— 1/n 

comment: Prescaling 
if > e then 

mm(A) 

1 log(min(A)) 

m <- r-, ,/ w . c <- e lo s(— (^)/™(^)) 

log(max( J 4)/ mm(A)) 7 

A <- ±A< m ) 

c 

else 

A <- -^jtsA 

min(A) 

end if 

B <- A&) 

comment: Main loop 
repeat 

apply one iteration of any diagonal scaling algorithm to B so B <— DB'D, where 
D, D' are diagonal matrices 
until B is e— bistochastic 

Delete all the entries of B which are less than a threshold t 



The naive computation of A^ is numerically unstable for large values of p. This 
can be avoided by the prescaling step in Algorithm [TJ Then, we set max(A) = 
maxjj dij, min(A) = min 0j >o <%•• By applying this prescaling, all the nonzero scaled 
entries will be placed in the [1, e] interval. In the case when max(A)/min(A) > e, the 
prescaling has another interesting property, that is, the scaled matrix is invariant by 
any entrywise power of the input matrix. In other words, if we apply the prescaling 
to the matrix A^ q \ for all q > 1, the matrix obtained after the prescaling step turns 
out to be independent of the choice of q. When < e the entries of A have 

already been located in the interval min(A)[l, e], then we do not need to perform the 
previous prescaling since the denominator in the formula defining m will be small if 
max( A) is close to min( A) . We shall also see in Section 14.1.11 that the iterations can 
be implemented robustly for large values of p by working with log-coordinates. Next, 
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we provide more details on the proposed algorithm. 

4.1. Sinkhorn iteration. A simple way to compute the diagonal matrices D, D' 
is Sinkhorn iteration [SK67 . This algorithm starts from a given matrix A, divides 
every row by its sum, then every column of the new matrix by its sum, and so on, until 
the matrix obtained in this way converges to a bistochastic matrix. The advantage of 
this algorithm is that, it can be efficiently implemented in parallel |ADRU08] and it 
can be applied to any non-negative matrix which has at least one nonzero permutation. 
The disadvantage is that, it is generally slower than other methods. 

Recall first that the open cone C = {x g K™ : n > 0,Vi} consisting of positive 
vectors of R™ is equipped with Hilbert's projective metric, defined by 

d(x,x') — log max ' J 

«,j X, t Xj 

Note that d(x, x 1 ) is zero if and only if the vectors x and x' are proportional. We refer 
to [BR97, § 6] for more background. In particular, if A is a positive matrix, a theorem 
of Birkhoff shows that the map x i— > Ax is a contraction in Hilbert's projective metric, 
with a contraction rate 

r d(Ay,Ay') , „ , . 9(A) 1 / 2 - 1 

k(A) := sup{— — :y,y €C,y,y non proportional} = , , 

d(y,y') 0(A) 1 / 2 + 1 

where 



0(A) = expsnp{d(Ay, Ay') : y,y' 6 C} = max 



i,j,p,l a,j r au 

The following result is a consequence of this theorem. 

Proposition 4.1 (Franklin and Lorenz }FL89j ). For a positive matrix A, the 
global rate of convergence of Sinkhorn iteration is bounded above by k(A) 2 . 

This general bound is applicable only for positive matrices and it can be coarse 
in practice. Recently, Knight Kni08 provided a local rate of convergence. Due 
to his work, for classical Sinkhorn iteration the local rate of convergence of a fully 
indecomposable matrix, is bounded by erf where o~% is the second singular value of 
the bistochastic matrix to which the iteration converges. Hence, the following result 
allows us to estimate the local convergence rate of Sinkhorn iteration, as p — > oo. 

Proposition 4.2. Assume that there is only one optimal permutation. Then, 
there is a constant c > such that 

1 — 0(cxp(— cpj) < a^(X(p)) < 1 asp — » oo 

Assume now that the matrix X(oo) is fully indecomposable (which implies that there 
are several optimal permutations). Then. 

a 2 (X(p)) ->• <t 2 (X(oo)) < 1 as p -s> oo . 



Proof. Due to the perturbation theorem of Mirsky |Mir60| . for any unitarily 
invariant norm ||.|| and n x n matrices, X and X with singular values o~\ > o~2 > • ■ ■ > 
a p and (fi > <T2 > . . . > cfp respectively, we have, 

|| diagfa- o-i)\\ <\\X-X\\. 
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So, for X(p) and X(oo), 

\a 2 (X(p)) - a 2 (X(oo))\ < \\X(p) ~ X(oo)|| 2 < 0(exp(-cp)) 

for which the constant c depends on the coefficients of the Puiseux series and pos- 
sibly on the dimension of X(p). Thus, if the original matrix has only one optimal 
permutation, <72(X(oo)) = 1 which implies that 

l-0(exp(-cp))<<7 2 (X(p)) 

Moreover according to the Birkhoff-von Neumann theorem [Bir46 , for any norm ||.| 
on ffi™ which is invariant under permutation of the coordinates and for any bistochastic 
matrix X, \\X\\ = 1 and subsequently 

1 - 0(exp(-cp)) < a 2 (X(p)) < 1 

When X(oo) is fully indecomposable, since the multiplication of two fully inde- 
composable matrices is also fully indecomposable, M = X(oo)X T (oo) is fully inde- 
composable. Note also that for all 1 < i < n, ma — X)j=i x % > ®> wmcn implies that 
M is primitive. Then, according to the Perron-Frobenius theorem, all the eigenvalues 
of M distinct from p(M) have a modulus strictly smaller than p(M) = 1 which yields 

<7 2 {X{00)) < 1. □ 

4.1.1. Logarithmic p-Sinkhorn iteration and approximate optimality 
certificates. As it was discussed before, computing the pth. Hadamard power of 
A may cause some numerical difficulties. To avoid this problem a prescaling has been 
proposed, after which all the matrix entries are in [1, e] interval. A theoretical disad- 
vantage of this prescaling is that the increase of p is limited since e p < I, where I is 
the largest number, in the numerical range. However, we next give a log-coordinate 
implementation of Sinkhorn iteration which avoids this limitation. This will provide 
as a by product a certificate allowing one to check the approximate optimality of a 
permutation. 

Let A £ W ixn be a real non- negative matrix which has total support. For a given 
p, consider the following iteration for a sequence of vectors Uk,Vk € R ra 

V = 1 (4.1) 
U k+1 =l(A^V k ) (4.2) 
V k+1 =l(A^ T U k+1 ) (4.3) 

where 1 is a vector [1, 1, . . . , 1] T of dimension n and I is an operator which inverses 
the entries of a vector. 

Proposition 4.3. For a nonnegative matrix, A, which has total support, the 
iteration defined by Equations \4-l\ and \4-3\ coincides with Sinkhorn iteration. 

Proof. Let W k and Z k respectively, be column scaled and row scaled matrices 
defined as the following: 

W k = diag(t/fc)AW diag04) 
Z k = diag(U k+1 )AW diag(V fe ) 

Also, let C denote the column scaling operator in which all the columns of a matrix 
are divided by it's sums and 1Z be the similar operator for rows. It is easy to verify 



12 



M. Sharify, S. Gaubert and L. Grigori 



that, 1Z(DB) = 1Z(B) and C(BD) = C(B) for any diagonal matrix D. According to 
the definition 

Z k = H{A<*) Amg{V k )) = n(di ag (U k )A^ dmg(V k )) = TZ(W k ) 

a similar statement can be proved for W k , that is, Wk = C(Zk) which completes the 
proof. □ 

Assume that U k = (u k ) = p~ 1 \ogU k and V k = (vf) = p _1 logT4, then, the 
logarithmic form of this iteration can be written as: 

u k+1 = -- log expp(log a l3 + v$) 

j 

v- +1 = -- log Vexpp(log ajJ + 

3 

Let 

£ij = log ay + Vj — max (log ay + Vj) 

yji = logaji + u k+1 - max (log + Uj +1 ) 

for which £y, f/ji < 0. The logarithmic iteration can be reformulated by using and 
jjji as the following: 

u k +1 = - max (log dij + v k ) - - log exppx y (4.4) 

j 

v k+1 = - max (log dji + u k+1 ) log V" exppyu (4.5) 

3 J P 

3 

The last iteration can be computed for a sufficiently large p, without having numerical 
difficulties. We note that a related trick was used by Malajovich and Zubelli pVlZOl 
in a different context. 

Proposition 4.4 (Approximate optimality certificate). Let U,V and X be pro- 
duced by the p-Sinkhorn iteration. Also, let Ci ■— l°gSj exppaiy and let Val(OAP) 
denote the logarithmic of the value of an optimal permutation. Then, 

n n n 

Val(OAP) <-^>-X>i-EC* • ( 4 -6) 

i—l J = l i—1 

Proof. Observe that at each step of the Sinkhorn iteration: 

log aij + v k < -u k+1 - Ci , l<i<n 

Let a denote an optimal permutation. Choosing j = o~(i) in the previous inequality, 
and summing over 1 < i < n, we get (|4.6|) . □ 

In practice, this proposition will be used to check the validity of the preprocessing, 
by comparing the logarithm of the value of the permutation which is eventually found 
with the upper bound (|4.6[) . 
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4.1.2. Experimental results. The experiments which are presented here have 
been obtained by using Sinkhorn iteration in Algorithm U as a diagonal scaling 
method. We used Matlab version 7.10.0. The detailed Matlab implementation of 
the algorithm is presented below. 

Finding the best value for p seems to be tricky since increasing p yields a slow 
convergence and at the same time, it yields the lower percentage of remaining entries. 
This fact also can be seen in Figures 14.11 14.21 which illustrate the percentage of the 
remaining entries and the required number of Sinkhorn iterations, for several values 
of p for the "lotkin" 1000 by 1000 matrix from the gallery of Matlab. 



Matlab code for p-Sinkhorn iteration 



function [it , A] =psinkhorn(A) 
n=size (A , 1) ; 
t=l/n; 
p=100; 

Min=min(A(A>0)) ; 
Max=max(A(A>0)) ; 

if (Max/Min)>exp(l) °/ prescaling 
m=l/(log(Max)-log(Min) ) ; 
c=exp (log (Min) / (log (Max) -log(Min) ) ) ; 
A=(l/c)*(A.~m) ; 

else 

m=l/log(Max) ; 
A=A. ~m; 

end 

A=A.~(p) ; 

d=(l/n)+l; 

it=0; 

while (d> 1/n) "/.main loop 

A=diag (sparse ( (A*ones(n, 1) ) . " (-1) ) )*A; 
A=A*diag (sparse ( (A'*ones(n, 1) ) . " (-1))) ; 
d=max(abs (sum(A' )-l) ) ; 
it=it+l; 

end; 

[indx , indy] =f ind (A>t ) ; 
A=sparse (indx , indy, 1 ,n,n) . *A; 

end 



In the following experiments, we set the parameter p to 100 which leads to a 
reasonable decrease in the size of the problem and generally does not yield to a 
slow convergence, however it could be any reasonably large value. Recall that the 
convergence is measured by max^ \n — 1|, where denotes the ith row (column) 
sum for a column (row) stochastic matrix. Table 14.11 displays the results for dense 
matrices from the gallery of test matrices of Matlab. For these experiments the 
dimension is 5000. The columns from left to right are: gallery name, number of 
nonzeros, number of iterations, the logarithmic value of optimal assignment and the 
percentage of remaining entries after deleting small entries. The same results are 
also presented for a random matrix, referred to as '"rand" '(the random function of 
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P P 

Fig. 4.1. The number of iterations as a Fig. 4.2. The percentage of remaining en- 
function of p. tries as a function of p. 



Matlab) and an Euclidean random matrix referred to as '"Euclidean"'. The latter, 
which is of interest in statistical physics, is a matrix whose entries are functions of 
random points in an Euclidean space |Par02j . More precisely, we draw at random 2n 
points X\, . . . , x n \ 2/1, . . . , y n uniformly in the unit cube of K 3 . Then, we consider the 
matrix A = (ay) where ay = exp(— d(x{, yj)) and d is the Euclidean distance. In 
this way, a permutation a which maximizes ]X =1 aij is the same permutation which 
minimizes the distance between these two sets of points. 

Table 4.1 

Sinkhorn iteration for dense matrices from the gallery of test matrices of Matlab and for random 
and random Euclidean distance matrices 



Gallery 


nnz 


No. it. 


Val(OAP) 


Rem. En.(%) 


cauchy 


25000000 


79 


4.54725B + 00 


47.95 


minij 


25000000 


47.3 


1.25025E + 07 


26.57 


moler 


25000000 


304 


4.99950B + 07 


28.43 


orthog 


25000000 


304 


4.99950B + 07 


28.43 


pei 


25000000 


1 


5.50000B + 04 


00.02 


prolate 


25000000 


42 


2.00000B + 03 


00.66 


randcorr 


25000000 


1 


5.00000B + 03 


00.02 


toeppd 


25000000 


1 


1.24767.E + 07 


00.02 


chebvand 


24997500 


2 


5.00000B + 03 


38.67 


circul 


25000000 


1 


2.50000B + 07 


19.48 


cycol 


25000000 


3 


1.73422B + 04 


13.23 


lotkin 


25000000 


7.3 


5.54715B + 00 


48.59 


rand 


25000000 


2 


4.99837B + 03 


28.38 


Euclidean 


25000000 


417 


4.77693.E + 0.3 


01.49 


chcbspcc 


25000000 


1084 


5.33411B + 07 


01.98 


lchmcr 


25000000 


3537 


5.00000B + 03 


18.58 


gcdmat 


25000000 


11174 


1.25025E + 07 


00.06 



As Table |4~T1 shows . For more than 58% of the cases, the algorithm converges very 
fast (in less than 80 iterations) and for 82% of the cases the algorithm converges in less 
than 500 iterations (which is less than 0.1 of the dimension of the input matrix). Also 
for more than 41% of the cases the original problem reduced to a new problem which 
has less than 2% of the original entries and in 82% it reduces to a new problem with 
less than 30% of the input entries. Since, the Sinkhorn iteration can be implemented 
in parallel, this method can be efficiently applied to large dense optimal assignment 
problems as a parallel preprocessing to reduce the size of the original problem. 

We also tested several sparse matrices from The University of Florida Sparse 
Matrix Collection. The results show that using Sinkhorn iteration as a diagonal scaling 
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method in Algorithm [T] generally makes a slow convergence for sparse matrices. 

4.2. Newton iteration. Solving the diagonal matrix scaling problem by using 
Newton iteration has been considered first in the work of Khachian and Kahalan- 
tari |KK92j for positive semidefmite symmetric matrices. They have considered the 
more general problem of finding a positive zero of the mapping 

f(x) ^b + Ax-x- 1 

where A is a given matrix of dimension n and b is a fixed n— dimensional vector. They 
proposed a path-following Newton algorithm of complexity 0(^/nL) where L is the 
binary length of the input. 

Recently, Knight and Ruiz have considered a Newton algorithm for nonnegative 
matrices |K R07] . For a symmetric matrix A, they considered the diagonal matrix 
scaling problem as finding a vector x such that 

f(x) = D(x)Ax -1 = 

where D{x) = diag(x). If A is nonsymmetric, then the following matrix will be 
considered as the input of the algorithm. 

o) 

They showed that the Newton iteration can be written as 

AkXk+i = Ax k + D(xkY 1 t 

where Ak = A + D(xk)^ 1 D(Axk)- Thus in each iteration a linear system of equa- 
tions should be solved for which they used the Conjugate Gradient method. In the 
nonsymmetric case, the latter linear system is singular, however it is proved that the 
system is consistent whenever A has support (A > has support if it has a positive 
diagonal). Our experiments which will be presented later shows that, the method 
works fast for dense nonsymmetric matrices. However with the default tuning pa- 
rameters, it does not work fast in sparse nonsymmetric cases. More details and the 
exact implementation of this method can be found in [KR07] , Here, we used the later 
method in Algorithm Q] to find the scaling matrices. We also set the parameter p to 
100 which is the same as Sinkhorn iteration. 

In the following tables, No. it. denotes the total number of operations, each of 
them takes 0(n 2 ) time to be done. This includes all the iterations of Conjugate 
Gradient method for each Newton step. Tables 14.21 and 14.31 show the results for 
dense symmetric and nonsymmetric matrices with dimension 5000. For both cases 
the algorithm converges rapidly in a small number of iterations. The percentage of 
the remaining entries is reasonably less than the original problem. In fact, in more 
than 38% of the cases, the original problem reduced to a much smaller problem which 
has less than 2% of the original entries and in 72% of the cases the problem reduces 
to a problem with less than 30% of the original entries. 

Tables 14.41 and 14.51 show the result of this algorithm on several sparse symmetric 
and nonsymmetric matrices from The University of Florida Sparse Matrix Collection. 
These results show that the algorithm generally works very well for sparse symmetric 
matrices while the convergence for sparse nonsymmetric matrices is not fast. 
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Table 4.2 

Newton iteration for dense symmetric matrices 



Gallery 


nnz 


No. it. 


Val(OAP) 


Rom. En.(%) 


cauchy 


25000000 


156 


-4.10569B + 04 


47.95 


ficdlcr 


24995000 


175 


3.91202B + 04 


35.73 


gcdmat 


25000000 


152 


3.75911B + 04 


00.06 


lchmcr 


25000000 


166 


0.00000B + 00 


18.58 


minij 


25000000 


167 


3.75911B + 04 


26.57 


molcr 


25000000 


167 


4.45149B + 04 


28.43 


orthog 


25000000 


164 


-1.9561.E + 04 


48.10 


pci 


25000000 


151 


1.19895B + 04 


00.02 


prolate 


25000000 


155 


-4.58145B + 03 


00.66 


randcorr 


25000000 


151 


0.00000S + 00 


00.02 


toeppd 


25000000 


151 


3.91132B + 04 


00.02 



Table 4.3 

Newton iteration for dense nonsymmetric matrices 



Gallery 


nnz 


No. it. 


Val(OAP) 


Rem. En.(%) 


chcbspcc 


25000000 


251 


4.03274B + 04 


01.98 


chebvand 


24997500 


166 


-1.19254B - 03 


38.67 


circul 


25000000 


161 


4.25860B + 04 


19.48 


cycol 


25000000 


162 


6.19386S + 03 


11.81 


lotkin 


25000000 


257 


-4.10477B + 04 


48.59 


rand 


25000000 


164 


-1.63137B + 00 


28.39 


Euclidean 


25000000 


314 


-2.30779.E + 02 


01.49 



5. Deformed Sinkhorn iteration. In the previous section, we computed X(p) 
for a fixed value of p. However, it is natural to develop a "path following method" 
in which the value of p is gradually increased in the course of Sinkhorn balancing 
iterations. In this section we propose such an algorithm. We prove that if the matrix 
A has support (A has support if it has a positive diagonal) , and if the growth of p is 
moderate enough, then the sequence of matrices produced by the algorithm converges 
to a point which belongs to the face generated by optimal permutations. 

5.1. Definition. Let A e R nx ™ be a real non-negative matrix. Consider the 
following iteration which is a standard Sinkhorn iteration with a deformation of using 
a sequence p m which goes to infinity. 

U m+1 =l(A (p ^V m ) 
V m+1 = l(A^+^ T U m+1 ) 

Let W m+ i and Z m respectively, be column scaled and row scaled matrices defined as 
the following: 

W m+1 = di ag (U m+1 )A ( ^ dmg(V m+1 ) 

Z m = diag(!7 m+ iM (Pm+l) diag(K0 (5.1) 



Proposition 5.1. For a diagonal matrix D, real matrices B, C and the matrices 
W mi Z m in the iteration, the following properties hold. 

1. 1Z(C o (DB)) = 1Z(C o B) where o indicates the Hadamard product 

2. W m = C(Z m _i) 

3. Z m =Tl(W m oA^ + i-Pm)) 
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Table 4.4 

Newton iteration for sparse symmetric matrices 



Gallery 






No it 


V el l y y^JI\r ) 


xxein. Ejii. i /o ] 


2cubcs_sphere 


1 m a oo 
iui4yz 


1 fi 1 "70KA 
104 I Z04 


155 


i oGfi/i c; z? i nfi 

i.zyo4o.c/ -+- uo 


95.91 


Andrews 


OUUUU 


7fim ^/l 
1 OU104 


101 


1. 4uZUZ.fi -+- UO 


fl7 QQ 

u i .oy 


apachc2 


71 1 7fi 


A Q 1 7Q7n 
4ol iOiV 


155 


c CC1 CC 77 1 _|_ nfi 
O.OOlOO-tv 4- UO 


26. 18 


Doncoui 


127224 


OOIDDUZ 


IDO 


1 1 ^fi99 TP _L nfi 
1.1O0ZZ.C/ -+- UO 


no qi 

U z . 1 


dcnormal 


oy4UU 


1 100ZZ4 


10O 


9 BQQ70 F i n£ 
— z.ooo/yfi -+■ UO 


H7 7Q 


Dubcova3 


1 A fifiSO 


'ifi'ififi/l "3 
0000040 


ioy 


S ^1 SO TP _1_ n^ 

— o.ooioy-C/ -\- uo 


AP. ^."7 
40.0 I 


ecology 1 


i nnnnnn 
1UUUUUU 


4yyouuu 


i 

lOO 


O.014y4£/ 4- UO 


20.02 




1U040 / 


97n71 7Q 

z / u / 1 / y 


101 


7 nini i tp _i_ n^ 
— * . UlU 1 1 -C/ ~r UO 


7Q 

/ y.yo 


nnanoiz 


74752 


KOfiOQ9 

oyoyyz 


151 


1 .Uo4 ilil f UO 


19.67 


G2_circuit 


150102 


79fifi7/l 


153 


fi ^Q/iQfiP 1 _i_n^ 

D.0040Dfi 4- UO 


41.77 


^ja/\srio 


61349 


ooolouy 


162 


9 QOORC FT _|_ n^ 
z.ozzOo-C/ 4- UO 


28.82 


gas_scnsor 


fifiOl 7 

ooyi ( 


1 / Uoooo 


i fin 
10U 


/i cQoriQ TP _i_ nc; 
— 4.oyoUofi -+■ UO 


on 

yu.o i 


H20 


O / UZ4 


ZZ10 f OO 


10O 


"3 nai /lop _l nc; 

O.Uol4y£/ -+- UO 


m no 


ncmizauo 


oyzzo / 


97/1 1 Q'i ^ 

Z / 4iyoo 


lOo 


K m n9fi I? _i_ nc; 
O.UlUZO-tv 4- UO 


i/i 1 1 

14. ol 


ijin 


9 t;<;nnn 
zOOU UU 


i 7fifi/i nn 
1 / O04UU 


lOo 


i ar\x. r >fi TP _i_ nfi 
1.0U0Z0.C/ 4- UO 


1/1 /I Q 
14.4y 


nasasrb 


040 / U 


9fi77"39/1 
ZD f / OZ4 


1 fil 
101 


S ^fi/17^P _L n^ 
O.OD4/O.C/ UO 


fi9 "i7 
OZ.O I 


offshore 


259789 


4242673 


161 


4.84144i? + 06 


99.87 


parabolic_fcm 


525825 


3674625 


153 


-4.83938.E + 05 


71.46 


qa8fm 


66127 


1660579 


153 


-5.51168S + 05 


03.98 


rail_79S41 


79841 


553921 


151 


-8.54968S + 05 


15.09 


s3dkq4m2 


90449 


4427725 


161 


5.21115E + 04 


73.77 


shallow_water2 


81920 


327680 


151 


1.95771£ + 06 


25.00 


ship_003 


121728 


3777036 


161 


3.05969^ + 06 


85.85 


shipscc8 


114919 


3303553 


164 


1.94819E + 06 


82.96 


t3dh_e 


79171 


4352105 


156 


-1.28870E + 06 


27.32 


thermomcch_TK 


102158 


711558 


151 


4.85968E + 05 


15.49 


tmt_sym 


726713 


5080961 


158 


1.00529E + 06 


71.46 


filtcr3D 


106437 


2707179 


161 


-7.01011E + 05 


79.95 


G3_circuit 


1585478 


7660826 


153 


6.72048£; + 06 


72.19 


H20 


67024 


2216736 


153 


3.08149S + 05 


03.02 


Si02 


155331 


11283503 


153 


7.14208S + 05 


17.34 


thcrma!2 


1228045 


8580313 


154 


1.63908S + 06 


80.32 



Table 4.5 

Newton iteration for sparse nonsymmetric matrices 



Gallery 


n 


nnz 


No. it. 


Val(OAP) 


Rem. En.(%) 


af'23560 


23560 


460598 


2248 


8.74776B + 04 


70.32 


baycr04 


20545 


85537 


183275 


-5.45190B + 04 


80.21 


bbmat 


38744 


1771722 


2234 


4.73786B + 04 


32.83 


ecl32 


51993 


380415 


23389 


-2.73185B + 05 


81.66 


g7jac200sc 


59310 


717620 


47245 


3.93891.E + 04 


86.40 


gematll 


4929 


33108 


2780 


4.07095.E + 03 


84.70 


grahaml 


9035 


335472 


4014 


-1.84675B + 04 


51.59 


hcircuit 


105676 


513072 


34980 


-3.83585B + 05 


88.59 


hydrl 


5308 


22680 


73772 


5.25311E + 03 


78.65 


jpwh.991 


991 


6027 


151 


1.47688E + 03 


16.44 


lhr71c 


70304 


1528092 


2227871 


-7.63013B + 04 


83.56 


mahindas 


1258 


7682 


3485 


-6.49190B + 01 


31.71 


onetonel 


36057 


335552 


23601 


1.13220E + 05 


87.97 


onetone2 


36057 


222596 


24122 


1.13220E + 05 


85.64 


orani67S 


2529 


90158 


5073 


-1.57076B + 02 


05.20 


shcrman3 


5005 


20033 


168 


-2.62102B + 04 


85.63 


shcrman5 


3312 


20793 


1696 


6.67064B + 03 


29.55 



Proof. Wc only prove the last one since others are straightforward. 

Z m = TZ(A^^ diag(V m )) 

= TZ(A^ diag(K„) o A^^p^) 

= ll((di a g(U m )A^ diag(t/ m )) O ^(Pm+l-Pm)) 
= U{W m O ^(Pm+l-Pm)) 
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According to the previous proposition, we define the following iteration which we 
refer to as deformed Sinkhorn iteration. 

W =C(A {po) ); 

W m = C(Z m ^), Cm = (Z m _! T )l (5.2) 
Z m = K(W m o A ( - Pm+1 ~ Pm ^), r m = (W m o A^-^)l (5.3) 

Here, r mi c m respectively are the vectors of row sums and column sums. 

5.2. Convergence to optimal assignment. For an input matrix, A — (oij)> 
assume that the deformed Sinkhorn iteration converges to a bistochastic matrix. De- 
fine the weight of a permutation, a, with respect to A, to be uio-(A) = J^J j a ia .^y If A 
has a support, it should have at least one optimal permutation as a op t with nonzero 
weight. It is evident that a op t is the optimal permutation for all the matrices W m and 
Z m produced by each deformed Sinkhorn iteration. Observe that for all permutations 
a and n, the ratio is invariant if we multiply the matrix A by diagonal matrices. 
So it follows from the Equation 15. II that 

Thus, for all non optimal permutations such as a, ij^-^z™) w ^ u converge to zero when 
p m — > oo. Since in each iteration the weight of optimal permutation, uj aopt (Z m ), is 
bounded above by 1, the weight of all non optimal permutations will converge to zero 
which yields the following lemma. 

Lemma 5.2. Assume that the deformed Sinkhorn iteration converges to a matrix, 
Z, produced by the deformed Sinkhorn iteration when p rn — > oo. If the original matrix 
A has a support, then all the permutations of Z have zero weight, except the optimal 
permutations of the original matrix A. 

Due to the theorem of Birkhoff-von Neumann, a square bistochastic matrix in R 
is a convex combination of permutation matrices. Hence, all the nonzero entries of 
a bistochastic matrix belong to a permutation with nonzero weight. This statement 
together with the previous lemma yield the following theorem. 

Theorem 5.3. For a non-negative matrix A which has a support, as p m — > oo, if 
the deformed Sinkhorn iteration converges to a matrix X , then all the nonzero entries 
of X belong to an optimal permutation of the original matrix. 

5.3. Convergence to bistochastic matrix for positive matrices. Recall 
that the rate of convergence of the classical Sinkhorn iteration is bounded above by 
k(A) 2 where n{A) = g|^ 1/2 ~^ . The following theorem presents the main result of 
this section: 

Theorem 5.4. Let A be a positive matrix. If p m — a\og(m + 1) where < 
alog# < 2, then the deformed Sinkhorn iteration will converge to a bistochastic matrix 
and subsequently to a solution of optimal assignment of the original matrix A. 

The proof relies on the next lemmas. For a matrix A, 8(A) = 9(A T ), and for two 
diagonally equivalent matrices such as A and B, 9(A) — 9(B). 

Lemma 5.5. For positive matrices A and B, diagonal matrix D, and d(x,x') the 
Hilbert projective metric, the following properties hold. 

1. d(Ax,Ax') < n(A)d(x,x') 

2. d((A o B)x, x>) < log + d(Ax, x>) 
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3. k(AD oB) = k(A o BD) = k((A o B)D) = k(D(A o B)) = k(A o B) 
Proof. The proof is straightforward. □ 

Corollary 5.6. k(A) is invariant under 1Z or C operators. 
Lemma 5.7. Let W m and Z m be the matrices in Equations \5. 215. S\) at iteration 
to. The following properties hold. 

1. n{Z m ) = k(A&™+i)) 

2. K{W m ) = K(A^) 

Proof. The proof is straightforward by using the induction on to. □ 

The next lemma is similar to Lemma 2 in [FL89j . where the classical Sinkhorn 

iteration is considered. 

Lemma 5.8. Let r m ,c m be the vectors defined in Equation \5.2\5.3\) at iteration 

to and M — max (^) then, 

d(r m , 1) < (Pm+i -Pm)logM + (p m -p m _i)K(A (Pm) )logM 

+K(A^) K (A^-^)d(r m _ 1 ,l) 
d(c m , 1) < {p m -p m _i)logM + (p m -p m _i) K (A (p "- l) )logAf 

+K 2 (A^-^)d(c m ^,l) 

Proof. Let t/V indicates the entrywise inverse of a given vector, V . We have, 
r m = (W m o A^-p^)1 = {Z m _ x diag(l/c m ) o A^~^)l 

= {Z m _ x o A^-r^) diag(l/c m )l = (Z m _! o A^+i-P^)(l/c m ), 

so 

d(r m , 1)= d((Z m _t o A^+i-^)(l/c m ), Z m _!l) 

< (Pm+l - Prn) log M + K(Z m _i)d(c m , 1) 

= (p m+ i -p m ) log M + K(A (p ^)d(c m , 1). 

Also 

d(c mi l)= d((W m -i T o A^P^ T )(l/r m ^),W m ^ T l) 
< (p m -p m -i)logM + K(VK m _i T )rf(l/r m _ 1 , 1) 
= (p m -Pm-i)logM + /c(W m _i)d(r m _i,l) 
= (p m -p m _i)logAf + K(^ (Pm - l) )d(r m _i,l), 

then 

d(r mj l) < {p m +i-Pm)logM + (p m -p m - 1 )K(A ( - p ^)logM 
+ K (A^)K(A ( P^)d(r m ^,l) 

The second statement is established in a similar way. □ 

Lemma 5.9. Assume that p m = alog(m + l), where < alog6>(^4) < 2. Then we 
have linim-yoo d(c m , 1) — 0. 

Proof. Since 

d(c m , 1) = a log ^-tl log M + a log r H^l K (A^-^) log M 
m to 

+/s 3 (A<*»-0)d( Cro _ 1 , l) 

2a log ill o . ,«•„ s. ,, 
< — + K 2 (A^-^)d(c m ^, 1) . 
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Let Pi := d(c.\, 1), and define the sequence j3 m by j3 m :— / TO -i(/3 m -i), where 

. . 2a log M o. ,v 

f m -l(x) = — + K 2 (A (Pm ^ . 

m 

Since every function f m is nondecreasing, an immediate induction shows that d{c mi 1) < 
/?,„, for all m > 1, and so, it suffices to show that lim m (3 m = 0. 
Let l m be the fixed point of f m -i- Setting 

a log 9(A) 



and observing that 



we get 



4m -a 



(1 + rn- Q ) 2 



2a log M a log M (l + m 



m(l-K 2 (A(P™-i))) 



Since < a < 1, one readily checks that the sequence l m decreases with m and 
converges to zero. If /3 TO +i < l m for every m, then lin^^oo /3 m < lim TO ^oo l m = 0, 
and the result is established. Assume now that p m +\ > l m for some m. Define 
5k ■— /3fc+i — h for all k > m. Observe that 

4+i = fk(Pk) - fk(h) = n 2 (A^){p k - l k ) = n 2 {A^)8 k + K 2 (A (Pfc) )(/ fc _ 1 - l k ) . 
Using the fact that k 2 (A (?v) ) < 1 holds for all r, an immediate induction yields 

k-l 

S k < ( I] k2 (^ (M )M™ + « ro - fc, Vfc > m + 1 . (5.4) 

r—m 

Since 1 - k 2 (A^) ~ 4r~ Q , we have 

OO 

Y[ k(A^)=0 

r—m 

Letting k — > oo in (|5.4[) , we get limsupj.^^4 < l m . Since this holds for all m, it 
follows that limsup^oQ 5k < 0, and so limsup^^ Pk+i — nm su Pfc^oo 4 + ^fe — 
limsup^^ 5k + limfe-j.oo lk = 0. Hence, (3k converges to zero. □ 

The proof of the Theorem 15.41 is achieved since linim^oo d(c m , 1) = yields 
lim^oo d(r mi 1) = 

6. Conclusion. We considered the connection between the entropy maximiza- 
tion problem and the optimal assignment problem. This allows us to propose an 
algorithm which can be used as a preprocessing in the solution of large scale optimal 
assignment problems to reduce the size of the input problem in terms of memory 
requirements. 

Two variants of the algorithm have been implemented. The first variant, which 
is based on Sinkhorn iteration, shows a generally reasonable convergence for dense 
matrices, with a reduction of up to 99% of the input problem. However the algorithm 



A parallel preprocessing for the optimal assignment problem 



21 



works slowly for sparse matrices. This version of the algorithm can be efficiently used 
as a parallel preprocessing to reduce the size of the input problem in very large dense 
optimal assignment problems. 

Another variant of the algorithm, implemented by using the Newton iteration, 
shows fast convergence for all dense matrices and sparse symmetric matrices. However 
the convergence speed for sparse nonsymmetric matrices is slow. 

The last part of the paper concerns a new iterative method that we refer to as 
deformed- Sinkhorn iteration. It is proved that the iteration converges to the solution 
of optimal assignment problem, if the input matrix is positive and if it has only one 
optimal permutation. For positive matrices with more than one optimal permutation, 
the iteration converges to a matrix for which all the nonzero entries belong to at least 
one optimal permutation. 

Acknowledgment. The authors thank Jean-Charles Gilbert for his comments on 
an early version of this manuscript. 
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